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Abstract. An approach to the classification problem of machine learning, based 
on building local classification rules, is developed. The local rules are considered 
as projections of the global classification rules to the event we want to classify. A 
massive global optimization algorithm is used for optimization of quality criterion. 
The algorithm, which has polynomial complexity in typical case, is used to find 
all high-quality local rules. The other distinctive feature of the algorithm is the 
integration of attributes levels selection (for ordered attributes) with rules searching 
and original conflicting rules resolution strategy. The algorithm is practical; it was 
tested on a number of data sets from UCI repository, and a comparison with the 
other predicting techniques is presented. 
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1. Introduction 



Extraction of structural information from raw data is a problem which 
is of great interest for both fundamental and applied studies. This paper 
will focus on one specific example of this problem — classification. 
The goal is to predict a class of a particular event. This problem was 
approached from a number of different disciplines, including Statistical 
Data Analysis (Dobson, 1990; Limnios and Nikulin, 2000), Machine 
Learning (Carbonell et al, 1983; Shavlik and Dietterich, 1990; Aha, 
1997; Mitchell, 1990), Fuzzy Logic (Hellendoorn and Driankov, 1997), 
Operations Research (Walker, 1999) and Data Mining (Pitaetsky and 
Frawley, 1991; Fayyad et al, 1996). As a result, a variety of learning 
techniques was developed. The result of learning can be represented in 
a number of different forms. The form that we are interested in working 
with is a set of rules. It should be stressed that some other forms (such 
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as decision trees, fuzzy models and many others) are equivalent to a 
set of rules. 

A set of rules (or any other form to which it is equivalent) is often 
a preferred form of knowledge representation because it allows for a 
simple answer to the question, "What was learned?" This specific set 
of rules was learned from the data. For an algorithm, which produces 
only an answer, it is often impossible to understand what was really 
learned and why this specific answer was produced. (The two mentioned 
knowledge representations differ as follows: in the case that the result 
is a rule, the learned knowledge is represented in a language which is 
richer than one used to describe the dataset; in the case that the result 
is a value, the learned knowledge is represented in the same language 
as the one used to describe the dataset (Quinlan, 1993).) 

The model-based techniques, such as developed in (Quinlan, 1992; 
Riddle et al, 1994), take training data as input and produce a set of 
rules (or statements which are equivalent to rules) which can classify 
any event. The lazy instance-based techniques, such as developed in 
(Aha et al, 1991; Aha, 1997; Quinlan, 1993), return a result tailored to 
the specific event we want to classify. With such techniques the events 
similar to the given one are usually found first, then a prediction based 
on found instances is made. An interesting attempt to combine model 
based and lazy instance based learning was presented in (Melli, 1998). 
In (Melli, 1998) a greedy lazy model-based approach for classification 
was developed in which the result was a rule tailored to the specific 
observation. While such an approach gives a simple rule as an answer 
(which is often much easier to understand than a complex rules set) 
and often works faster for classification of a single event, it-as every 
greedy algorithm-is not guaranteed to find the best rule, because the 
algorithm may not reach the global maximum of the quality criterion 
and a sub-optimal rule may be returned. 

In the work (Riddle et al, 1994) an approach based on the brute 
force of rule-space scanning was developed. It was used for finding the 
"nuggets" of knowledge in the data (each nugget is a rule with a high 
degree of correctness). In contrast with greedy type algorithms, massive 
search algorithms are guaranteed to find the best rule(s). 

In our early work (Malyshkin et al, 1999) we presented an approach 
which combined the massive model-based rule search approach with 
lazy instance-based learning. In that work we were also interested in 
"nuggets" of knowledge, but only those which were applicable for the 
instance we wanted to classify. The result was a set of rules which were 
applicable for classification of the given event. One may think about 
these rules as a projection of a global classification rules set to the given 
instance of the event. 
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In the current paper this approach is taken to the next level, and a 
practical algorithm, applicable to a variety of problems, is presented. 
A number of significant improvements have been made since that early 
version. The current algorithm includes the following new features: 
1. highly optimized rule-space scanning, which allows problems with 
significant number of attributes to be solved; 2. integration of levels 
selection procedure for ordered (continuous and literal) attributes with 
the rule search algorithm; 3. information about dependent attributes di- 
rectly included into the tree search algorithm thus significantly reducing 
computational complexity; and 4. an original conflicting rules resolu- 
tion strategy which was especially built to work with automatically 
generated rules. 

To create a practical algorithm, the three aspects — logical, statis- 
tical and computational complexity need to be addressed. In section 
2 we formulate the problem and discuss the logical formulas which 
represent the rules we are interested in finding. In section 3 we discuss 
the statistical quality criterion which can be used for evaluation of 
rule quality and specify the criteria which we use in this work. We 
also present a conflicting rules resolution strategy for automatically 
generated rules. At the end of section 3 a sketch of the algorithm 
is presented. In section 4 we discuss the selection of attributes for 
analysis; it should be stressed that some attributes as they are built 
in section 4 are not independent, and this fact is known in advance. 
In section 5 we discuss computational complexity issues; an approach 
which includes information about dependence of the attributes into 
the algorithm is proposed. In section 6 we discuss error estimation. In 
section 7 we present the data analysis results and compare our results 
with the results of C4.5R8 (Quinlan, 1992). In section 8 a discussion is 
presented. 



2. Logical formulas as a result of statistical analysis 

In this section we describe logical formulas obtained as a result of 
data analysis. Representation of knowledge after it has been learned 
from the data, can vary depending on the approach used. However, 
different forms of knowledge representation(decision tables, decision 
trees, rules list, etc.) are equivalent to some logical formulas. Formulas 
obtained during data analysis are usually quite complex when applied 
to prediction or classification. This complicates the understanding of 
the results. The major source of complexity is the fact that the formulas 
are usually built to be applicable to all data observations. As we show 
below the complexity of the rules can be significantly reduced if, instead 
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of building global rules, we build local rules which are defined on a 
subset of observation data; this subset must include the data point 
where we want to perform a prediction/classification. Such an approach 
combines the best of both instance-based and model-based learning. 
It can be described as an approach working with projections of global 
formulas to local observations. A drawback of such an approach is the 
need to recalculate the rules for every event we want to classify. This 
is the cost of using simple local rules instead of complex global rules. 

In the simplest form the problem can be represented as the following: 
We have a random variable g (consequent) and a random vector x 
(antecedent) of M components x^ m \m = 1...M. Random variables 
g and are assumed to take two different values: true and false 
(Note that this does not limit us in using other types of input data. 
The detailed process of g and x selection will be described in section 
4.) We have a finite number of observations N + 1, each observation 
gives specific values of xl™' and g n . Index n = ... TV numerates the 
observations. The value of antecedent x^ is known for n = . . . N, the 
value of consequent g n is known for n = I . . . N, at the point n = the 
value of consequent is unknown. The problem is predicting the value of 
g at n = 0. Again, we are interested in finding a prediction of g only at 
one point n = 0, not in building a universal prediction formula which is 
applicable at any n. This allows us to build a prediction which is easier 
to build, understand, and interpret. 

The prediction is represented as a set of conjunctive forms which are 
correct with a high degree of confidence; this set of conjunctives may 
be considered as a distinctive conjunctive form of a logical formula. 
The criterion of acceptance/rejection will be described in section 3. 
Consider all possible expressions of the form: 



/= n 4 m) =4 m) a) 

In the Eq. (1) each term is a match of x^ antecedent component 
at a given point n, with the value of the x^ m > at the point we want to 
make a prediction: n = 0; index m belongs to a given set of indexes 
{/[/}; we have logical "and" in between all these terms, i.e. the formula 
(1) represents a fact of simultaneous matches of several antecedent 
components (those with indexes in {^} set) with their values at the 
prediction point n = 0. Each formula of (1) type is completely defined 
by a set {//}. In total there are 2 M possible {n} sets. 
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The goal is to find the conjunctive of form (1) which can give an 
implication with high degree of confidence. 

( J] * (m) = 4 m) ) - 9 = 9^ (2) 

Formula (2) represents an implication rule when a simultaneous match 
of given antecedent components (those in {//} set), with their values at 
a point to predict n = 0, gives specific values of consequent. The value 
is the value that the rule (2) predicts. Note that the rule of form (2) 
is defined on a subset of all available observations (on observations on 
which (1) is true). We do not consider the rules (even if they have very 
high confidence) which can not be applied at n = 0. This drastically 
reduces the number of rules we may accept. 

In the next section we discuss statistical criteria used for the evalu- 
ation of each rule quality and for resolving the problem of conflicting 
rules (when several high confidence rules predict different values of g). 



3. Local prediction rules: statistical evaluation of quality 
and conflicts resolution 

Quality evaluation of a rule is based upon its statistical characteris- 
tics. In this paper we use canonical statistics: statistics which can be 
expressed via components of matrix of joint distribution (f,g): 

( P(f = false, g = false) ; P(f = false, g = true) \ 
\ P(f = true, g = false) ; P(f = true, g = true) J 

Here g is the consequent and / is a logical formula; for example, one 
from Eq. (1); the (3) is 2x2 matrix (because both / and g take two 
different values). Probability P can be defined in a number of different 
ways. In this paper the probability is defined in a standard combinatoric 
way (the number of favorite outcomes divided by the total number of 
outcomes). Almost any of the commonly used (coverage, correctness) 
type of criteria can be expressed via the components of a matrix (3). 

There are many different statistics which can be used for quality 
evaluation of a logical formula. In the work (Hajek and Havranek, 
1978) an approach of logical formulas transformation was developed 
which may solve an exponential complexity combinatoric problem in 
polynomial time. A similar approach was developed in (Lyashenko, 
1989), where only statistics allowing formula transformations increasing 
quality criterion were used. 
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A statistic commonly used as a quality criterion is information gain 
(Shennon, 1948). The information gain based criterion was used in a 
number of machine learning studies. This criterion usually works well 
for the evaluation of global rules, but much less effectively for local 
rules. In the case of local rules the major problem with information gain 
criterion is the fact that / — > g and ->/ — > —ig are equally important for 
this criterion. For the rules of (2) type we know in advance that / = 
true and this asymmetry should be included into the quality criterion. 
Information gain criterion work well in the case of a seldom event. For 
example, for an event which happens in 1 out of 100 cases a rule which 
predicts that the event will never happen has 0.99 correctness. At the 
same time, an information gain based criterion gives no value to such 
a rule because we get no extra information beyond what we already 
know. 

The most widely used statistics for estimation of a logical formula 
quality are ones of (coverage, correctness) type; the coverage is defined 
as P(f = true; g = go" r " > )/P{g = g^), and the correctness is defined as 

P(f = true;g = <7q"^)/P(/ = true). In (Riddle et al, 1994) a criterion 
based on high correctness (the coverage considered to be secondary) has 
been used. A criterion based on the F-measure (which combines preci- 
sion and recall into one number) from information retrieval theory (van 
Rijsbergen, 1979) can also be used as a quality criterion. An important 
characteristic of the F-measure is the presence of a parameter allowing 
the adjustment of relative importance of coverage and correctness. 

In this paper we use a quality criterion which has properties similar 
to one of (coverage, correctness) type. The quality a of implication rule 
is defined as following: 

a = } P(f + true;g^ g^) P(f = true;g = g^) 

<7o"^ : P(f = true;g = g^) is maximal (5) 



In this paper we focus on predicting the events, not the probabilities, 
so for a given / we first select the value (of two possible values) of g^^ 

which gives maximum of P(f = true;g = g^), Eq. (5), then evaluate 
the quality of implication rule using quality criterion (4). The value of 
a is equal to 1 for implications (2), giving totally correct predictions 
for every observation. For implications with non-perfect correctness 
and/or coverage the value of a is lower than 1. The parameter < 
A < 1 determines the relative importance of coverage and correctness. 
The value A = 0.5 makes coverage and correctness equally important 
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characteristics of a rule. The values A > 0.5 make correctness more 
important than coverage. 

While different statistics give very similar results on data which does 
not produce conflicting rules, the difference between different statis- 
tics may become significant when analyzing data producing conflicting 
rules. Our experiments with different types of data have shown that 
quality criterion (4) works well for the different data that we tested. To 
resolve a problem of conflicting rules we separate the process of making 
a prediction on two steps. On the first step we do not predict the specific 
value of g, we just find all implication rules of high quality. On the 
second step we use all found implication rules to obtain a prediction. Let 
us assume we found all rules of high enough quality; for example, with 
a quality better than a given acceptance level ao- Each rule predicts its 
own <7q"^ at n = 0. If we have no conflicting rules (all accepted rules 

predict the same value of g^) everything is very simple: this value 
is the value we predict at n = 0. If we have conflicting rules (rules 
which predict different values of g), the situation is more complicated, 
and a conflict resolution strategy must be developed. This is a special 
problem which has been considered in a number of publications. (See 
Refs. (Brownstown et al, 1985; Durkin, 1994; Lucas and Van Der Gaag, 
1991) for review.) Most studies focus on resolving conflicts between 
hand-crafted, rather than automatically generated rules. The conflict 
resolution of automatically generated rules has its own specifics. The 
simplest approach is to accept only one (the best) rule. The problem is 
the fact that it is common to have a number of rules of similar quality, 
and the idea of taking a single rule and leaving a number of rules of 
similar quality out of consideration often causes a significant bias in 
data analysis. An approach often used to resolve such conflicts is the 
idea of ordering rules, but it gives away an extremely useful property 
of rules-based predictions — the ability to evaluate rules in arbitrary 
order. 

The approach we use in this paper differs from the ones mentioned 
above in a very significant way. We assume that all accepted rules 
must be incorporated into the prediction formula. If we do not have 
conflicting rules prediction quality usually increases by combining all 
rules. If we do have conflicting rules, prediction quality may decrease 
(often in a very significant way) when the rules are combined. 

For resolving the problem of conflicting rules consider the following 
problem: Let s be a set of observations on which the value of / from 
(1) is true. The P(s) is the probability of an observation to give true 
value of / and P(g = g^ r ^ j s) is the probability of an observation to 

have g equal to g^ under the condition that the observation belongs 
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to s. Note that these two probabilities are just equal to P(f = true) 
and P(g = g^ j f = true) respectively, but for conflict resolution it 
is much more convenient to work with a set of observations than with 
individual rules. The problem of resolving conflicting rules is equiva- 
lent to the following: For a number of sets s q , q = 1 . . . Q determines 
probabilities of different outcomes of g under the condition that all s q 
are true. For a single rule (Q = 1) the answer is trivial: this is either 
P(g = g^ r ^ j s) or P(g = g^ r ^) depending on whether we accepted 
or rejected a rule. For more than one rule (Q > 1), a formal answer 
can be also written: this is either P(g = g^^j s\ fl S2 H . . . f~l sq) or 

P(g = g^) depending on whether we accepted the rules or not. The 
problem is that the probability P(g = g^ j s\ fl S2 fl . . . fl sq) cannot 
even be estimated because the set si n S2 fl . . . n sq often has few ob- 
servations, insufficient for probability calculation. There is an example 
of this: Assume we have 100 observations of g and 101 observations of 
x^ m \m = 1 ... 2, in the point to predict antecedent x = (true, true). 
Let g take the value of true on 50 observations and false on the other 
50. Suppose we have two implication rules (x^ = true) — ► (g = false) 
and (x^ = true) — > (g = true); both give perfect prediction (cor- 
rectness and coverage are equal to 1) on these 100 observations. What 
will be the probability of different values of g in the point to predict 
x = (true, true)? We have two perfect rules. The first one predicts 
g = false, and, the second one predicts g = true. The probability 
P(g = g^ j s(x^ = true) fl s(x^ = true)) cannot be calculated 

because we have no observation with known g when = true and 
x( 2 ) = true simultaneously. 

To resolve such conflicts we build a set S from all s q sets and then 
apply a quality criterion to a single "combined" rule which is defined on 
S. This way the problem of conflicting rules is resolved by introducing a 
new, "combined" rule, and the answer is the same as the one mentioned 
above for a single rule: The probability is either P(g = j S) or 

P(g = g^) depending on whether we accepted or rejected a combined 
rule. Having only one rule we may use a number of different criteria 
to evaluate this "combined" rule quality; for example, in addition to 
criterion (4) we may use x 2 criterion or any other criteria. Different 
criteria usually give similar results in the case of a single rule (because 
adjustment of acceptance level does not affect how many rules will 
be accepted/rejected: we have only one rule to consider). It should 
be stressed here that the quality of combined rule may be lower than 
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individual rule quality. If this happens this often indicates the presence 
of rule conflicts or data overfitting. 

The only problem left to discuss is how to obtain the set S from 
individual sets s q . There is no universal way to do this, because the 
sets s q are sensitive to the quality criteria. 

The simplest way is to choose the set S as a union of all s q 

S = si U s 2 U . . . U sq (6) 

Returning to a simple example above with two perfect conflicting rules: 
the set S = s(x^ = true) U s(x^ = true)) covers all 100 observa- 
tions and the criterion (4) produces 0.5 value (0.5 correctness with 1.0 
coverage), which is a very low value. The "combined" rule must be 

rejected and unconditional probabilities P(g = g^) should be used 
for prediction. This is what we intuitively expect in such an extreme 
case of conflicting rules. There are several other ways to select the set 
of observations S. We will not discuss all the variants here. The way 
to select S in (6) form seems to work the best for the quality criterion 
(4). In addition to that the (6) way to select S is well protected against 
data overfitting, because overfitted rules often produce different values 
of g which drastically reduce combined rule quality. 

Let us return to the original problem we formulated in the beginning 
of section 2. Now we can present an algorithm for predicting the value 
of consequent at n = 0. 

1. Select acceptance level ao- 

2. Initialize set S to an empty set. 

3. For every set of antecedent indexes {fi} (totally there are 2 M sets) 
do: 

a) Build implication (2) and evaluate quality a of it. 

b) If a > ao add all observation points for which / from Eq. (1) is 
true to the set S. 

4. Evaluate the quality of a "combined" rule: the rule which is de- 
fined on observations from S. This can be done by using the same 
criterion (4), x 2 or any other type of criterion. If the combined 

rule is accepted use P(g = j S) , if rejected use P(g = g^) 

probability to predict the fact of g taking value g^ at n = 0. 

The predicted value g^ corresponds to the event with maximal 
probability. 

The algorithm described above is of exponential complexity (one needs 
to check 2 M possible implication rules). As we will show in section 5 the 



artl.tex; 1/02/2008; 21:29; p. 9 



10 



Vladislav G. Malyshkin et al 



complexity may be significantly reduced in an average case. Before we 
start discussing computational complexity let us discuss the procedure 
of attributes selection for antecedent and consequent. 



4. Selection of attributes for analysis 

In all of the considerations above, we always assumed that consequent 
g and antecedent components are Boolean attributes. There are 
many cases in which the data contain attributes of other types. In 
addition to Boolean variables in this paper we consider continuous 
variables (variables taking values from an interval) and discrete (literal) 
variables (variables taking values from a finite set of possible values). 
The requirement of ordering (so we can compare the values which 
the variable takes) is very important for analysis, because this allows 
us to build an effective algorithm of levels selection. The case with 
non-ordered values is much less interesting, because in this case for a 
descrete variable the algorithm described above will use the following 
Boolean attribute: whether the value of the attribute is equal to its 
value at n = or not. 

Let us consider a variable (continuous or discrete) r n (index n = 
... N enumerates the observations) taking values from some ordered 
set (for example an interval). We convert r n to a number of Boolean 
attributes which will be used as the components of vector x. This 
transformation is performed by selecting a grid yi, I = 1 . . . L and com- 
paring the value of r with levels yi, that gives antecedent components 
^(m(z)) = r n < y\. The question is how to select levels yi to use in 
implication. The most commonly used approach is to take a single 
level. People usually do this because an increase in the number of levels 
increases the number of antecedent components that can drastically in- 
crease computational complexity. The most common criterion used for 
selection of the split level is information gain criterion. In several works 
(Dougherty et al, 1995; Quinlan, 1996) this criterion was successfully 
applied for determination of levels of comparison. 

We propose a new approach for antecedent attributes selection. The 
major new characteristics of proposed approach is integration of two 
usually independent steps into one step, so the inference algorithm 
described in section 3 will perform not only data analysis, but will also 
select levels to compare. 

We do not limit ourselves to one or two levels that we can compare 
with; we use a number of levels (the value of L can be chosen pretty 
high) and determine the real levels to use directly during data analysis. 
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One may think about this as automatic selection of levels in singleton 
Mamdani rules in fuzzy logic, see (Mamdani and Assilian, 1975). 

The first step is to take an ordered {yi < yi+i) grid yi, I = 1 . . . L, 
which has many different levels. (The levels yi, may be selected as 
all possible values of r or by using any of supervised or unsuper- 
vised discretization techniques (Dougherty et al, 1995). These levels 
are only "initial" levels. The inference algorithm will select from these 
the "real" levels which will be used in implication rules.) Then we 
obtain L antecedent components = r < yi. The attributes are 
not independent. From the fact of ordering of yi follows that if x$ is 
true then xffl is also true for p > I. Also if x$ is false then x^ is also 
false for p < I. 

The second step is to find the highest index I for which tq < y\ is false 
(ro is the value of r in the point to predict n = 0), and mark this index 
as h. Then yi with I = h, h — 1, h — 2, . . . , 1 may be considered as lower 
boundaries of r and y\ while I = h+1, h+2, . . . , L may be considered as 
upper boundaries of r. These upper and lower boundaries can be con- 
sidered as fuzzy levels for r. Instead of determining specific values for 
upper/lower levels from some ad hoc special procedure, we select them 
during data analysis by using the inference algorithm we described in 
the previous section. Such integration allows us to automatically select 
the best level for a rule. While it may look like we have increased the 
number of antecedent components and exponentially increased compu- 
tational complexity, this is not really the case. The difference between 
standard approach (Witten and Frank, 1999) p. 246, when a fc-valued 
variable is replaced by k — 1 synthetic Boolean variables, and our ap- 
proach is that we incorporate the knowledge about the dependence of 
these k — 1 variables into the inference algorithm, In section 5 we show 
that this knowledge can drastically reduce computational complexity 
in average case. 

The problem of consequent variable selection is usually more straight- 
forward than that for antecedents. If consequent j is a Boolean (literal 
variable with two values) nothing special should be done about con- 
sequent selection and we use j as consequent g. If j is an ordered 
(continuous or discrete) variable then we take a grid yi,l = 1...L, 
yi+i > yi an d just run the analysis for every g = (j < yi). Additional 
testing on monotonic increase of the predicted probability of true value 
of g with increase of I may be performed to test the consistency of 
the predictor. The algorithm in section 3 can also be applied to g 
taking more than two values, because the quality criterion (4) may 
be generalized to such g. 
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5. Estimation of computational complexity for brute force 



As we showed in section 3, a brute force algorithm is of exponential com- 
plexity (it requires 2 M rules evaluation). However, the implication rules 
we consider are not independent. It is often possible to determine from 
one rule's characteristics that a set of rules does not have a member of 
required quality so that set of rules can be taken out of consideration. 
The requirement of preventing data overfitting also helps because it 
eliminates rules that are too complex. In addition some optimization 
techniques can be applied. This way we can often perform brute force 
analysis for a problem with a significant number of components. 

Let us discuss the properties which allow us to reduce computational 
complexity. 

1. Preventing data overfitting. This usually requires taking out of 
consideration overly complex rules. We do this by considering only 
rules with less that M max terms {M max < M) in implication (2). 
This immediately reduces the number of rules to consider from 2 M 



to C° M + C X M + . . . + C^ max « M M JJ which is still too high. 



2. Taking into account dependent antecedent components. In this 
paper we consider the simplest case: upper and lower boundary 
antecedent variables as we build them in section 4. 

Antecedent attributes as we build them in section 4 from variable 
r (which takes values in some ordered set) are not independent. 
For example the components of lower boundary r n < y\ with I = 
h,h — l,h — 2, . . . , 1 have the following property 



The property (7) follows from the fact of ordering of yi, the way 

of h selection which leads to Xq™^ 1 ^ = Xq™^ 2 ^ = false and the 
following equation: 



An equation very similar to (7) can be also written for upper bound- 
ary set r n < yi with I = h+1, h+2, . . . , L. This means that only one 
attribute from the upper(lower) boundary set needs to be included 



rules analysis 




(7) 
(8) 



Z 3 = max(li,l 2 ) 



(r < a)&(r < b) = (r < min(a, b)) 



(9) 
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in implication (2). If we put two components from the upper (lower) 
boundary set of attributes then, by applying a (9) type of trans- 
formation, we can always replace two terms by a single one. This 
property, which is known directly from the antecedent, allows us to 
reduce the number of implications we need to consider. Increase in 
computational complexity when adding one set with rid dependent 
antecedent components selected as described above in terms of com- 
putational complexity is equivalent to adding much fewer (about 
log 2 (n ( i+l)) independent components. This is why we can integrate 
selection of fuzzy levels with the inference algorithm without much 
increase in computational complexity. Addition of L levels to test is 
equivalent to adding about log 2 (/i+l) + log 2 (£ — h+1) independent 
Boolean attributes. 

3. If an implication rule of (2) form has a perfect (or close to perfect) 
correctness, then the quality of this rule can not be improved by 
adding more elements to set {fi} (see (Riddle et al, 1994)), because 
by adding more conditions we just decrease coverage while correct- 
ness cannot be further improved. This means we do not need to 
consider the subsets of rules with close to perfect correctness. 

4. As it has been shown in (Riddle et al, 1994), an implication rule 
(and all rules which include it) with coverage below some level 
cannot produce a rule of the required quality. This requirement 
can be slightly improved by using minimal probability p( v ) for ev- 
ery consequent value (the v £ {true, false} is one of two possible 
consequent values). Specifically, for at least one v we must have 
P((g = v)k(f = true)) > p( v \ If we have no single v for which this 
condition holds, then the implication (and all rules which include 
it) cannot produce a rule of the required quality. For the quality 
criterion (4) the value of p^ can be easily obtained 

p(«) = («o ~ A)P(g = v) (1Q) 
1 — A 

5. An implication rule must not have redundant conditions. An ex- 
treme example of redundant condition is a situation when a term 
Xn = Xq ^ ^ is added to implication (2) twice. This does not 
change any property of a rule, it just increases the complexity of 
it. To check for redundancy of a rule with m conjunctions we may 
compare the rule with m rules obtained by taking out one condition 
from the original rule, see section 7.3.13 (page 318), Ref. (Hajek 
and Havranek, 1978). Specifically in our case this criterion can be 
formulated as following: Having a {fj,} set with m elements consider 
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m formulas f m of (1) type, each one is obtained by taking out one of 
m element. If for at least one f m there is no v for which the condition 
P((/ m = true)k(f = false)k(g = v)) > p^ ism holds, then this 
rule (and all rules which include it) have redundant conditions and 

(v) 

should not be considered. The value of p mism can be obtained from 
the same formula (10) which was used for p( v \ The only difference 
is the different value of threshold a>Q. For mismatches, the threshold 
ati is usually chosen lower that ag. 

The five properties presented above allow us to build an algorithm of 
polynomial complexity. This comes from the fact that we are interested 
only in rules applicable at n = 0, what reduces the number of rules to 
consider from 2 2 to 2 M and from the properties 4 and 5 which limit 
the maximal tree depth in a typical case. In the worst case the tree 
depth is limited by the value of M max from item 1. The other properties 
reduce the complexity further. This algorithm, which in typical case is 
of polinomial complexity on N and M, can be applied for solving a 
variety of practical problems. 

The algorithm can be applied to a brute force analysis for a problem 
with a significant number of components. A sketch for the algorithm 
is the following: All possible implication rules may be represented as 
a tree. Each node has an antecedent index assigned to it. Every node 
can be mapped to a {^} set (by taking indexes of this node and all 
its ancestors). This property means that if node A is an ancestor of 
node B, then fs = Ja&X where /a and fs are formulas of (1) type 
obtained from a {fi} set corresponding to nodes A and B respectively, 
that allows us to implement the algorithm as a recursive tree scanning 
algorithm and directly incorporate five properties above as indicators 
for a branch not having a rule of the required quality. We discuss 
different applications in section 7. 

6. Predictor: error estimation 

An estimation of predictor correctness usually involves building a global 
rule on training data and then evaluating this rule's quality on testing 
data. While this testing approach suits well for testing global rules, it 
is not very convenient when considering local rules, because for every 
prediction point we may have different local rules. It is nice to know 
the quality of a local rule, but this information is not useful for error 
estimation at the other prediction points. 

The best way to perform testing in such a case is to test the average 
performance of the predictor. One may consider a predictor as some 
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kind of "global rule" and estimate its quality. The quality of such a 
"global rule" is equivalent to the predictor average quality. 

A common problem of errors estimation is the limited number of 
observations. Techniques such as bootstrap and cross-validation are 
commonly used for performing error estimation with a limited number 
of observations. 

For local predictors, a leave-one-out type of cross-validation is very 
promising when working with a limited number of observations. This 
type of testing includes creation of a set with N — 1 observations and 
this data is used for predicting the value at one left point with known 
value of g. The procedure is repeated ./V times and average predictor 
performance is obtained. Mentioned in (Witten and Frank, 1999) the 
non-stratification problem of testing data (the data in every testing 
set has only one observation and does not have the right proportions 
of observations with different values of g) is much less an issue in the 
case of local predictions than in the case of global predictions, because 
the predictor was specifically built to be applicable at the point where 
it tested. 

In case we have plenty of data, we can estimate predictor average 
performance without leave-one-out cross-validation. The fact of the 
local nature of the predictor should be taken into account when per- 
forming the tests. Assume we have a training set of N observations and 
testing set of T observations. To determine predictor average perfor- 
mance we predict the value for every observation in a testing set using 
all the observations from the training set. In total we run predictor 
T times (for every observation from testing set) each time using the 
same training set with AT observations and estimate predictor average 
performance from these T predictor runs. Predictor average correctness 
C is defined as: 

C = E PJi ( n ) 

j=true, false 

t f(g = j) ^(predicted) = k >\ 

Pjk = ^ j, L (12) 

The probabilities in (11) are calculated in the testing space; the value 
of t ((<? = j)Sz(g^ = kfj is the number of tests (totally there are T 
test runs) when the value of the consequent which really happened was 
equal to j and the predicted value was k. 

One of the problems with (11) and similar types of criteria is its 
dependence on unconditional probabilities of different outcomes of g. 
For example, if we have an event which happens in 1 out of 100 cases, 
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then a predictor predicting that the event will never happen has 0.99 
correctness. This high value of correctness is not a result of predictor 
quality but of the distribution of g. One may use information gain 
based criteria, but using several criteria simultaneously complicates the 
analysis. This problem does not arise when the (11) criterion is used 
for relative comparison of different predicting techniques on the same 
data, because in this case we have identical distribution of g. 



7. Results of data analysis 

The real algorithm has a number of features not presented in the basic 
algorithm description which we gave in sections 3, 4 and 5. These are 
some of them: 

1. The acceptance level ao is dynamically adjusted. First we set an 
initial acceptance level. Then, during tree scanning, required accep- 
tance level gets automatically increased to na if we find a rule with 
quality a such us < na, i.e. we keep only rules with quality better 
than k fraction of the best rule quality; all rules with the quality below 
this value are pruned. The value k = 1 corresponds to the case when 
only the best rule is accepted. 

2. Dependent variables are also handled in a slightly more complex 
way than described because of additional optimization. 

These and other details which are not described here make the algo- 
rithm practical. This algorithm was implemented in the MLS program 
(Massive Local Search) , the complete source code of which is available 
from (Malyshkin, 2000). 

The following parameters were used during all trials. The parameter 
A in quality criterion (4) was set to 0.75 making correctness more 
important than coverage. The maximal tree depth M max was set to 8. 
The 

c-min was set to 0.08, i.e. we accept only rules with quality better 
than the quality of a perfectly correct rule covering 0.08 of positive 
samples (with the exception of Chess, Mushroom, Spambase for which 
c-min = 0.17 was used). The minimal number of mismatches (item 5 in 
section 5) was also determined on a base of minimal coverage; the value 
of 

c rnin"^ was se * t° 0.02 (with the exception of Chess, Mushroom, 
Spambase for which c^™ sm ' 1 =0.1 was used). This threshold stayed 
the same during tree scanning and was not adjusted as it was for the 
matches. 

For non-ordered input variables antecedent components were built 
as a fact of the exact match of variable value with its value at a point to 
predict. For ordered input variables (with the exception of Ionosphere 
and Spambase for which we used exact match variables) antecedent 
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components were built as described in section 4. For ordered literal 
variables we used all possible values as initial levels yi, I = 1 . . . L. For 
continuous variables a discretization was performed first to build or- 
dered literal variables, then the same attribute selection procedure used 
for ordered literal variables was applied. The initial levels for continuous 
attributes may be selected in a number of different ways, for sufficiently 
big L different supervised and unsupervised techniques give very similar 
results for predictor quality. From a computational complexity point of 
view it is good to have low values of L. The entropy based discretization 
(Dougherty et al, 1995) gives a very good balance of quality and levels 
numbers. In this work the entropy based discretization from (Dougherty 
et al, 1995) was used for initial selection of levels yi for continuous 
variables, then the procedure described in section 4 was applied. The 
utility we used is available from (Kohavi et al, 1997). 

A rich collection of data from UCI repository (Blake and Merz, 1998) 
allows a comprehensive data analysis on data from different domains to 
be performed. Predictor correctness was estimated using 3-fold cross- 
validation with stratification. Obtained results were compared with 
ones produced by widely used program C4.5R8 (Quinlan, 1992) with 
default settings. In Table I we present the comparison of MLS with 
C4.5R8. For comprehensive comparison with the other predicting tech- 
niques we refer to (Lim et al, 2000; Zheng and Webb, 2000; Gama and 
Brazdil, 2000), where a variety of predicting techniques were tested 
on the same data from UCI repository. The error estimation from 
these works can be directly compared with ones from Table I of this 
paper, which allows our technique to be easily compared with the other 
predicting techniques. 

The exceptions mentioned above in algorithm parameter values for 
some datasets (Chess, Mushroom and Spambase) were required to 

reduce computation time. The higher values of c m j„ and c^™ 5 "^ the 
earlier tree scanning algorithm, will reach termination criteria. 

The first column of Table I identifies the data set. The second and 
third columns contain predictor correctness C for C4.5R8 and our pro- 
gram MLS respectively. The fourth column contains the total number 
of observations. (These are needed for calculation of correctness error 
due to the finite number of tests run. For a given confidence level and 
number of tests run the lower boundary of C can be estimated using 
standard statistical technique (Schervish, 1995). We do not demon- 
strate this analysis here because we are interested only in comparison 
of two predicting techniques.) The number of antecedent variables is 
presented in the last column. This value is for estimation of compu- 
tational complexity. (Note that the number of antecedent components 
typically higher than the number of variables because the methodology 
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Table I. MLS and C4.5R8 performance comparison. 



data 


C4.5R8 


MLS 


* ooservazions 


■^variables 


Monkl 


1.0 


1.0 


432 


6 


Monk2 


0.65 


0.71 


432 


6 


Monk3 


1.0 


0.972 


432 


6 


Breast-cancer 


0.73 


0.69 


286 


9 


Chess 


0.99 


0.90 


3196 


36 


Crx 


0.82 


0.86 


690 


15 


Diabetes 


0.74 


0.78 


768 


8 


Hepatitis 


0.75 


0.86 


155 


19 


Horse-colic 


0.8 


0.81 


368 


22 


Ionosphere 


0.90 


0.90 


351 


34 


Labor-neg 


0.72 


0.81 


57 


16 


Mushroom 


1.0 


0.96 


8124 


22 


Pima 


0.74 


0.77 


768 


8 


Spambase 


0.92 


0.87 


4601 


57 


Tic-tac-toe 


0.985 


0.99 


958 


9 


Vote 


0.96 


0.96 


435 


16 



from section 4 usually gives several antecedent components for a single 
variable.) 

The trials are usually executed much faster in C4.5R8 than MLS. 
First, because C4.5R8 is written in C while our program MLS is written 
in Java. Second, because we need to re-run the predictor for every test 
(lazy learning), while C4.5R8 does this only once (eager learning). This 
slowdown is important only when doing predictor testing, because we 
are especially interested in the tasks when just a few predictions, not 
about the same number as the training set, is necessary. Third, massive 
search algorithms are generally slower than decision tree "divide and 
conquer" type of algorithms. Despite running more slowly, the proposed 
algorithm is fast enough to solve practical problems. 

The Monkl, Monk2 and Monk3 are the problems usually tried first 
by different predictor algorithms. From Table I it follows that on monk 
tests MLS performs about the same or slightly better than C4.5R8. 

On Chess MLS performs noticeably worse that C4.5R8. This is be- 
cause the values of c m i n and c^™^" 1 ^ used for this trial effectively reduce 
maximal tree depth to a value of about 4. At the same time, C4.5R8 
generates a number of rules with more than 10 conditions. This Chess 
problem is an example of a problem for which massive search approach 
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is not effective: a large number of attributes produce complex rules. A 
similar effect (but to much lower degree) occurs in the Mushroom trial. 

On Crx, Diabetes, Hepatitis, Horse-colic, Labor-neg, Pima, Tic-tac- 
toe and Vote MLS performs about the same as or better than C4.5R8. 
These trials also have significant number of attributes, but the rules do 
not have too many conditions, and global optimization algorithm easily 
catches the best rule(s) without any major slowdown in calculations. 

Our tests also show that in some trials different values of c m i n 
(minimal coverage) and parameter A = 0.75 from Eq. (4) (relative 
importance of quality and correctness) may result in better correctness 
than presented in Table I. The higher c m i n the higher is the required 
quality for a rule to be accepted. As shown in section 5, the value 
°f c min affects computational complexity. The increase of c m j„ and 
c min decreases computational complexity by reducing the effective 
tree scanning depth. We expect that automatic adjustment of parame- 
ter A and required minimal coverage c m j n based on available data will 
make noticeable improvement to MLS. 

In addition to predictor quality on different datasets another thing 
we are interested in testing is an effect of attribute selection methodol- 
ogy from section 4 to predictor quality. To test this we ran the predictor 
twice on some datasets: the first time all antecedents were selected as 
a fact of exact match of variable value with its value at a point to 
predict, and the second time all antecedent components (even if they 
correspond to non-ordered attributes) were selected as a comparison 
with upper and lower levels in the way described in section 4. Note 
that the former selection can be always obtained from the latter one 
because the condition r = a is the same as (r < a)&(r > a — 1) (here 
the variable r assumed taking integer values from an interval). This 
way we tested how the quality of a predictor is affected by the increase 
of rule expressive power when we go from "exact match" type of at- 
tributes to the type of attributes built in the way described in section 
4. We performed this testing on five datasets with ordered attributes 
(in Monkl, Monk2 and Monk3 the structure of attributes values allows 
the variables being considered as ordered, and in Pima and Diabetes 
the attributes are ordered), and two datasets with literal non-ordered 
attributes (Tic-tac-toe and Vote) . The results are presented in Table II 

From these trials it follows that for datasets with ordered attributes 
(Monkl, Monk2 and Monk3) the transition from "exact match" to 
"levels comparison" may significantly increase predictor quality. In 
Monk2 no single rule found for an exact match (because we require high 
enough minimal coverage for a rule), but increased expressive power of 
generated rules allows us to generate high quality rules which obey 
the condition of minimal coverage. At the same time in trials where 
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Table II. Predictor results for "exact match" and 
"level comparison" type of attribute selection for 
MLS. 



data Exact Match Levels Comparison 



Monkl 1.0 1.0 

Monk2 - 0.71 

Monk3 0.972 0.972 

Pima 0.76 0.77 

Diabetes 0.78 0.78 

Tic-tac-toe 0.99 0.82 

Vote 0.96 0.94 



antecedent attributes were obtained as a result of entropy discretization 
(Diabetes and Pima) there is no strong effect of automatic selection 
of upper /lower boundaries. Because of computational complexity for 
Ionosphere and Spambase we did only "exact match" type of attribute 
selection trials, but preliminary results show that "exact match" type 
of attributes may produce even better results than "levels comparison" . 

For the datasets with non-ordered attributes (Tic-tac-toe and Vote, 
for which we forced non-ordered variables being considered as ordered) 
such transition may either not affect or even decrease predictor quality. 
This is because increased expressive power of the rules may cause an 
effect similar to data overfitting. The most clear example is Tic-tac-toe 
trial, where the global optimization algorithm finds many "false rules" , 
a combination of conditions which by chance happened to give a high 
value of quality criterion. Such "false rules" can be significantly reduced 
by increasing the value of minimal coverage. 

From these trials it follows that the approach to antecedent at- 
tributes selection from section 4 may give better results only for ordered 
attributes, and even in this case an "exact match" of attributes may 
produce better results in some instances. 

Presented test trials show that the massive search algorithm often 
performs about the same or better than C4.5R8 on many datasets. 
We attribute this to global optimization. There are also cases when 
MLS is less effective than decision tree "divide and conquer" type of 
algorithms. This usually happens on the datasets with a large number 
of attributes producing complex rules. 
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The described approach proves that a massive local rules search global 
optimization algorithm can be applied to problems with a significant 
number of attributes. The computational complexity can be greatly 
reduced by building rules which are specific to a prediction point and by 
using the optimization technique described above. The massive search 
algorithm is guaranteed to find the global maximum which makes it 
especially valuable for testing various predicting systems. 

In this work we have shown that the process of attributes selec- 
tion can be integrated with the process of rules search. This allows 
us to perform data analysis in a uniform way without separation of 
the attribute selection and the rules search stages. Prom a fuzzy logic 
approach this may be considered as automatic selection of levels in 
singleton Mamdani rules (Mamdani and Assilian, 1975). Such a method 
of attribute selection usually allows us to build more "expressive" rules. 
This is related to the fact that in many problems the comparison of 
the value with a level is a natural method of attribute selection for the 
problem. 

Another distinctive feature of the proposed algorithm is a conflicting 
rules resolution strategy. We accept a number of rules, then build a 
single rule for prediction based on accepted rules. The quality of this 
single rule may significantly decrease if accepted rules predict different 
values of consequence. 

While the described approach is already practical and was applied 
in the solution of a number of different problems, it can be further 
improved. From our point of view there are two improvements which 
would improve the algorithm. Firstly, the quality criterion (4) is dif- 
ferent than commonly used criteria. The major advantage of the cri- 
terion is the fact that its calculation can be optimized. The problem 
of probability calculation of a logical expression is a problem actively 
studied from a computational complexity point of view, see (Abra- 
ham, 1979; Heidtmann, 1989; Bertschy and Monney, 1996; Gorodetsky 
and Dubarenko, 1997; Anrig, 2000) and references therein. Because 
calculation of (4) is equivalent to calculation of a probability various 
optimizations used in reliability theory (Limnios and Nikulin, 2000) can 
be applied. Another improvement which can be added to the algorithm 
is automatic selection of minimal coverage c m i n and relative importance 
of coverage and correctness A. A flexible selection of these parame- 
ters often improves the results. These improvements, in our opinion, 
can further increase the correctness and decrease the computational 
complexity of the algorithm. 



artl.tex; 1/02/2008; 21:29; p. 21 



22 



Vladislav G. Malyshkin et al 

Acknowledgements 



Vladislav Malyshkin greatly appreciates Columbus Advisors LLC's sup- 
port for this study, especially the support from Emilio J. Lamar during 
Vladislav's employment with Columbus Advisors LLC. The authors 
would also like to thank Alexander Rybalov for many fruitful discus- 
sions. 



References 

Abraham, J. A. An improved algorithm for network reliability. IEEE Transactions 

on Reliability 28:58-61, 1979. 
Aha, D. W., D. Kibler and M. K. Albert. Instance-based learning algorithms. 

Machine Learning, 6:37-66, 1991. 
Aha, D. W., editor. Lazy Learning. Kluwer Academic, 1997. 

Anrig, B. A Generalization of the Algorithm of Abraham, in M. Nikulin and N. 

Limnios editors, Proceedings of MMR'2000, Second International Conference on 

Mathematical Methods in Reliability, pp. 95-98, Bordeaux, France. 2000. 
Bertschy, R. and P.A. Monney. A generalization of the algorithm of Heidtmann to 

non-monotone formulas. Journal of Computational and Applied Mathematics 

76:55-76, 1996. 

Blake, C.L. and C.J. Merz. Department of Information and Computer Science, 
University of California at Irvine, Irvine, CA The data if available via anonymous 
ftp from ftp://ftp.ics.uci.edu/pub/machine-learning-databases/ 

Brownstown, L., R.Farrell, E. Kant and N. Martin. Programming expert systems 
in OPS5. Readings, MA, Addison- Wesley. 

Carbonell, J. G., R. S. Michalski, and T. M. Mitchell. An overview of machine 
learning. In Michalski et al, editor, Machine Learning: An Artificial Intelligence 
Approach 1:3-24 Morgan Kaufmann, 1983. 

Dobson, A. J. An Introduction to Generalized Linear Models. Chapman & Hall 
1990. 

Dougherty, J., R. Kohavi and M. Sahami. Supervised and unsupervised dis- 
cretization of continuous features. In Proceedings Thirteen International Joint 
Conference on Artificial Intelligence, pp. 1022-1027. San Francisco: Morgan 
Kaufmann. 

Durkin, J. Expert Systems: Design and Development. Macmillan Publishing, New 
York. 1994. 

Fayyad, U. M., G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy. Advances in 
Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. 

Gama, J. and . Brazdil Cascade Generalization Machine Learning, 3:315-343, 2000. 

Gorodetsky, A. E. and V. V. Dubarenko. Method of approached calculation of 
probability of complex logic functions used in logic-probabilistic description 
of reliability of technical systems. Proceedings of MMRT997, First Interna- 
tional Conference on Mathematical Methods in Reliability, Part 1, Bucharest, 
Roumanie, 1997. 

Hajck, P. and T. Havranek. Mechanizing Hypothesis Formation. Springer- Verland 
Berlin Heidelberg New York 1978. 



artl.tex; 1/02/2008; 21:29; p. 22 



A Massive Local Rules Search Approach to the Classification Problem 



23 



Heidtman, K. D. Smaller sums of disjoint products by subproduct inversion. IEEE 

Transactions on Reliability 38(3):305-311, 1989. 
Hellendoorn, H. and D. Driankov, editors. Fuzzy Model Identification. Selected 

approaches. Springer- Verlag Berlin Heidelberg New York 1997. 
Kohavi, R., C. Brunk, A. Kozlov, C. Kunz, D. Sommerfield and E. Eros MLC++ We 

used MLC++ utility discretize available from http:/ /www. sgi.com/tech/mlc/ to 

perform discretization. This utility is written by J. Dougherty and R. Kohavi. 
Lim, T., W. Loh and Y. Shih A Comparison of Prediction Accuracy, Complexity, and 

Training Time of Thirty- Three Old and New Classification Algorithms Machine 

Learning, 3:203-228 2000 
Limnios, N. and M. Nikulin, editors Recent Advances in Reliability Theory. 

Birkhauser, Boston, Basel, Berlin, 2000. 
Lucas, P. and L. Van Der Gaag Principles of Expert Systems. Wokingham, England, 

Addison-Wesley, 1991. 
Lyashenko, N. N. Methods and Algorithms of Empirical Inference (Metody i Al- 

goritmy Induktivnogo Vyvoda). Preprint, in Russian. Leningrad Institute of 

Informatic and Automatic, V.M.Ponomarev, editor. Leningrad 1989. 
Malyshkin, V.G., R. Bakhramov and A. Gorodetsky, A Logical Approach to Sta- 
tistical Inference. In Proceedings of the Second International Conference of 

Dynamic Object Logic-Linguistic Control, DOLL99, A. E. Gorodetsky, editor, 

pp. 16-19, June 21-25, 1999, St. Petersburg, Russia. This paper is also available 

from http://www.polytechnik.com/machine_learning/papers/dolU999.pdf 
Malyshkin, V. G. Website with source code and other information. Available online 

from http:/ /www. polytechnik.com/machine_learning/ 
Mamdani, E. H. and S. Assilian An experiment in linguistic synthesis with a fuzzy 

logic controller. International Journal of Man-Machine Studies, 7:1-13, 1975 
Melli, G. A Lazy Model-Based Approach to On-Line Classification. J. Han thesis 

senior supervisor. M.S. Thesis, Simon Fraser University, 1998. 
Mitchell, T. M. Machine Learning. New York, McGraw Hill, 1997. 
Piatetsky-Shapiro, G. and W. J. Frawley Knowledge Discovery and Databases. 

AAAI/MIT Press, 1991. 
Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann, 

1990. The latest version of the program is C4.5 R8 which is available from 

ht t p : / / www .cse.unsw.edu.au/~quinlan/ 
Quinlan, J. R. Combining instance-based and model-based learning. In Proceedings 

of the Tenth International Conference on Machine Learning pp. 236-243, 1993. 
Quinlan, J. R. Improved Use of Continuous Attributes in C4.5. Journal of Artificial 

Intelligence Research, 4:77-90, 1996. 
Riddle, P., R. Segal, and O. Etzioni. Representation design and brute-force induction 

in a Boeing manufacturing domain. Applied Artificial Intelligence, 8:125-147, 

1994. 

van Rijsbergen, C. J. Information retrieval. London, Butter- 

worths, 2nd Edition, 1979. This book is available online from 

http://sherlock.berkeley.edu/IS205/IR_CJVR/ 

Schervish, M. J. Theory of Statistics. Springer Verlag, 2nd Edition, 1995. 

Shannon, C. E. A Mathematical Theory of Communications The Bell System 
Technical Journal, 27:379-423, 27:623-656, 1948. 

Shavlik, J. W. and T. G. Dietterich. Reading in Machine Learning. Morgan 
Kaufmann, 1990. 

Walker, R. C. Model Building in Mathematical Programming Wiley, John & Sons, 
4th edition, 1999. 



artl.tex; 1/02/2008; 21:29; p. 23 



24 Vladislav G. Malyshkin et al 

Witten, I. H. and E. Frank. Data Mining. Morgan-Kaufmann Publishers, 1999. 
Zheng, Z. and G. I. Webb Lazy Learning of Bayesian Rules Machine Learning, 
1:53-84, 2000 



artl.tex; 1/02/2008; 21:29; p. 24 



